In [72]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from mpl_toolkits.mplot3d import Axes3D
from sklearn import svm, grid_search, metrics, preprocessing, linear_model, cross_validation
from sklearn.learning_curve import validation_curve, learning_curve
%matplotlib inline
## Utility functions :
def plot_PCA(X, y, classes, azimuth, elevation):
""" X numpy array containing data,
y numpy array containing the labels,
classes a list containing the label names
azimuth, elevation float numbers """
nClasses = len(classes)
fig = plt.figure(1, figsize=(10, 8))
plt.clf()
ax = Axes3D(fig, axisbg='white', azim=azimuth, elev=elevation)
pca = PCA(n_components=3)
X_redux = pca.fit_transform(X)
print(pca.explained_variance_ratio_)
print("Variability of the compressed dataset %.4f" %sum(pca.explained_variance_ratio_))
cmap = plt.cm.Paired
colors = np.linspace(0, 1, nClasses)
for n in range(nClasses) :
index = np.ravel(y==(n+1))
ax.scatter(X_redux[index, 0], X_redux[index, 1], X_redux[index, 2], c=cmap(colors[n]), label=classes[n])
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))
ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
plt.title("Data plotted on the three first components", y=1.03) # y prevent title overlap with axes
plt.show()
def plot_validation_scores(train_scores, test_scores, param_range):
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.title("Validation Curve")
plt.xlabel("$C$")
plt.ylabel("F1 Score")
#plt.ylim(0.0, 1.1)
plt.semilogx(param_range, train_scores_mean, label="Training score", color="r", linestyle="--")
plt.fill_between(param_range, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.2, color="r")
plt.semilogx(param_range, test_scores_mean, label="Cross-validation score",
color="g")
plt.fill_between(param_range, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.2, color="g")
plt.legend(loc='lower right')
plt.show()
def flatten(somelist):
return([item for sublist in somelist for item in sublist])
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
"""
from sklearn examples :
http://scikit-learn.org/stable/auto_examples/plot_learning_curve.html#example-plot-learning-curve-py
"""
plt.figure()
plt.title(title)
if ylim is not None:
plt.ylim(*ylim)
plt.xlabel("Training examples")
plt.ylabel("Score")
train_sizes, train_scores, test_scores = learning_curve(
estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.grid()
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Cross-validation score")
plt.legend(loc='lower right')
plt.show()
We try to recognize activities of 30 individuals using sensor data gathered from their smartphone. Human activity recognition can be used to measure physical activity of users of a smarphone or some wearable as smartwatches. This may be used to compute how much calories one burns during the day, and to display advices if the user does not take enough exercise to remain healthy.
We use the Human Activity Recognition Using Smartphones Dataset (Version 1.0) that can be found on UCI archive.
It is experimental data : 30 volunteers (19-48 y.o.) performed six activities (walking, walking upstairs, walking downstairs, sitting, standing, laying) wearing a smartphone (Samsung Galaxy S II) on the waist. As the smartphone was worn on the waist, there is no rotation issue as there would be in the case the smartphone is "worn" in the pocket.
Signals come from the accelerometer and the gyroscope of the smartphone (captured at a constant rate of 50Hz). They were filtered to remove noise and then sampled in fixed-width sliding windows of 2.56 sec and 50% overlap (128 readings/window). The acceleration signal was separated into body and gravity acceleration (using another filter). Body acceleration (from accelerometer) and angular velocity (from accelrometer) were derived to obtain Jerk signals. Magnitude of the three dimensional signals was calculated using the euclidean norm. Fast Fourier Transform was used to compute frequency domain signals.
At the end, we get a set of 17 signals, and for each of them, several features were created :
mean: Mean value
std: Standard deviation
mad: Median absolute deviation
max: Largest value in array
min: Smallest value in array
sma: Signal magnitude area
energy: Energy measure. Sum of the squares divided by the number of values.
iqr: Interquartile range
entropy: Signal entropy
arCoeff: Autorregresion coefficients with Burg order equal to 4
correlation: correlation coefficient between two signals
maxInds: index of the frequency component with largest magnitude
meanFreq: Weighted average of the frequency components to obtain a mean frequency
skewness: skewness of the frequency domain signal
kurtosis: kurtosis of the frequency domain signal
bandsEnergy: Energy of a frequency interval within the 64 bins of the FFT of each window.
angle: Angle between to vectors.
Additional vectors obtained by averaging the signals in a signal window sample. These are used on the angle
variable. There are 561 features on the dataset :
In [2]:
!wc -l ./data/features.txt
In [3]:
!cat data/activity_labels.txt
y_train contains one label / line
X_train contains the 561 features / line
The whole thing should fit in a pandas dataframe as it is 63Mo :
In [71]:
!ls -lh data/train/
In [7]:
!awk -F' ' '{print NF; exit}' data/train/X_train.txt
561 columns separated by spaces (' ')
Problem : there is one space in front of positive numbers (in order to get neat columns when you look at the text file), it messes up with pandas parser. We use unix command line utilities to replace ' '
(double spaces) it with ' '
(single space), and remove additional spaces before the first column :
In [3]:
!sed 's/^..//' data/train/X_train.txt > data/train/X_train_buffer.txt
!sed -e 's/ / /g' data/train/X_train_buffer.txt > data/train/X_train_clean.txt
!rm data/train/X_train_buffer.txt
In [7]:
!sed 's/^..//' data/test/X_test.txt > data/test/X_test_buffer.txt
!sed -e 's/ / /g' data/test/X_test_buffer.txt > data/test/X_test_clean.txt
!rm data/test/X_test_buffer.txt
In [34]:
X_train = pd.read_csv('./data/train/X_train_clean.txt', sep=' ', engine='c', header=None)
feat_names = flatten(pd.read_csv('data/features.txt', sep=' ', header=None, index_col=0).values.tolist())
X_train.columns = feat_names
X_test = pd.read_csv('./data/test/X_test_clean.txt', sep=' ', engine='c', header=None)
X_test.columns = feat_names
X_train.head(5)
Out[34]:
Now read the labels :
In [35]:
classes = pd.read_csv('data/activity_labels.txt', header=None).values.tolist()
classes = [item for sublist in classes for item in sublist] # flatten list
y_train = pd.read_csv('data/train/y_train.txt', header=None)
y_test = pd.read_csv('data/test/y_test.txt', header=None)
y_train.columns = ["activity"]
y_test.columns = ["activity"]
y_train.head(5)
Out[35]:
In [36]:
X_train.shape
Out[36]:
In [37]:
X_test.shape
Out[37]:
In [38]:
X_train.describe()
Out[38]:
This is signal data from an experiment : very clean, very regular, no missing values
In [39]:
y_train.activity.value_counts()
Out[39]:
There is a slight class imbalance : Class 6 (resp. 2 and 3) is more (resp. less) represented than average representation. This could have a negative impact on our classifiaction, but oversampling seems complicated given our features, and undersampling is costly as we only have 7352 samples.
There are many variables. We do not hope any correlation plot : it would be useless (too much information in a limited space) plus it would a good way to explode the memory. Let's try a PCA to get a synthetic look at the data.
In [40]:
pca = PCA(n_components=10)
pca.fit(X_train)
print(pca.explained_variance_ratio_)
print("Variability of the compressed dataset %.4f" %sum(pca.explained_variance_ratio_))
Keeping three components seems to be reasonable, let's plot the data according to these three components
In [30]:
plot_PCA(X_train.values, y_train.values, classes, azimuth=59, elevation=-21)
In [31]:
plot_PCA(X_train.values, y_train.values, classes, azimuth=-93, elevation=-7)
There are two big clusters that can be separated by an hyperplane. Each of these big clusters are compound of three smaller overlapping clusters. The first cluster contains activities involving walking, while the second contains activities for which the subject stays still.
According to the frist three components of the PCA, each of the three walking activities seems to be well expressed by the data. Laying seems to be well separated from other still positions (as the gravity effect is measured on a different axis : the smartphone is horizontal while it is vertical for sitting and standing activities.). However, sitting and standing seems to be harder to seperate. This is quite intuitive, as they are quite similar activities from an accelerometer or gyroscope perspective.
Nonetheless, these figures gives us great hope of finding some good classifier to separate our data (remember we are only using the three first principal components, accounting for approximately 71.6% of the variability).
Having one component explaining 65.5% of the variability also accounts for some high correlation in X_train.
We use cross validation on 5-Folds. Parameters are selected using F1-score as this performance measure avoid giving good scores to naive models in case of class imbalance. Performance of each model is evaluated on test set (train test consists of 70% of the data) looking at precision, recall and F1-score. For more information on cross validation on K-Folds, cf. The Elements of Statistical Learning (T. Hastie & al.). For more information about the performance measure we use, cf. :
In [47]:
K_folds = 5
In [43]:
pca = PCA(n_components=50)
X_redux = pca.fit_transform(X_train)
print("\n Variability of the compressed dataset: %.4f of the total variability \n" %sum(pca.explained_variance_ratio_))
logit = linear_model.LogisticRegression()
logit.fit(X_redux, np.ravel(y_train))
y_pred = logit.predict(X_redux) ## No fit_predict method
print("\n performance on train dataset : \n")
print(metrics.classification_report(y_train, y_pred, target_names=classes))
print("\n\n performance on test dataset : \n")
y_pred = logit.predict(pca.transform(X_test))
print(metrics.classification_report(y_test, y_pred, target_names=classes))
In [44]:
logit = linear_model.LogisticRegression()
logit.fit(X_train, np.ravel(y_train))
y_pred = logit.predict(X_train) ## No fit_predict method
print("\n performance on train dataset : \n")
print(metrics.classification_report(y_train, y_pred, target_names=classes))
print("\n\n performance on test dataset : \n")
y_pred = logit.predict(X_test)
print(metrics.classification_report(y_test, y_pred, target_names=classes))
The model performs better without dimentionality reduction. As the time-complexity in the number of variable is reasonable considering the number of variables in our dataset, we decide to keep the original feature set.
Regarding performance, sitting and standing activities are harder to seperate as seen in the 3D plots. As sitting has a poor recall and standing a poor precision (compared to the other classes), we can deduce that lots of activites labeled as sitting are identified as standing. Resolving this issue may be hard.
There is some overfitting, thus we try a L1-regularized variant of logistic regression to prevent this.
To prevent overfitting, we use L1 regularization (L2 regularization gives very similar results in our case). For more information about regularization and the influence of the parameter $C$, cf. http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression, or The Elements of Statistical Learning (T. Hastie & al.)
In [48]:
logit = linear_model.LogisticRegression()
C = np.logspace(-2, 4, 15) # Exponential exploration of the parameter space
params = {'penalty':['l1'], 'C':C} # L1 Penalty
train_scores, valid_scores = validation_curve(logit, X_train, np.ravel(y_train), \
"C", C, cv=K_folds, verbose=0, scoring="f1", n_jobs=-1)
plot_validation_scores(train_scores, valid_scores, C)
The model still exhibits overfitting that cannot be reduced by penalizing the coefficients without decreasing the performance on cross validation samples. The cross-validation score is quite variable (+/- 2%) depending on the 5-Folds sampling.
In [49]:
logit = linear_model.LogisticRegression(penalty='l1', C=C[np.where(train_scores == np.max(train_scores))[0][0]])
print("Optimal C parameter : %.4f \n" %logit.C)
logit.fit(X_train, np.ravel(y_train))
y_pred = logit.predict(X_train) ## No fit_predict method
print("\n performance on train dataset : \n")
print(metrics.classification_report(y_train, y_pred, target_names=classes))
print("\n\n performance on test dataset : \n")
y_pred = logit.predict(X_test)
print(metrics.classification_report(y_test, y_pred, target_names=classes))
As seen previously, the sitting and standing activities are note well separated. There is a still a huge overfit despite the regularization.
We try to improve these results by using a SVM classifier, as the hyperplane found by this model is often better at classification than the one found by logistic regression thanks to the margin maximization (for more precision, cf. The Elements of Statistical Learning T. Hastie & al.).
As the data is likely to be separable by an hyperplane in the original feature space, no need for nonlinear kernel here. We user linearSVC
(SVC with linear kernel implementation of liblinear. Complexity is $O(np)$, i.e. this model is less complex than logistic regression that is $O(np^2)$. For more information on SVM and the influence of the parameter $C$, cf. http://scikit-learn.org/stable/modules/svm.html, or The Elements of Statistical Learning (T. Hastie & al.)
In [50]:
linSVC = svm.LinearSVC(dual=True, tol=0.0001, multi_class='ovr', fit_intercept=True, \
intercept_scaling=1, verbose=0, random_state=42)
C = np.logspace(-3, 3, 15) # Exponential exploration of the parameter space
params = {'C':C}
train_scores, valid_scores = validation_curve(linSVC, X_train, np.ravel(y_train), \
"C", C, cv=K_folds, verbose=0, scoring="f1", n_jobs=-1)
plot_validation_scores(train_scores, valid_scores, C)
In [60]:
linSVC.C = C[np.where(train_scores == np.max(train_scores))[0][0]]
print("Optimal C parameter : %.4f \n" %linSVC.C)
linSVC.fit(X_train, np.ravel(y_train))
y_pred = linSVC.predict(X_train)
print("\n performance on train dataset : \n")
print(metrics.classification_report(y_train, y_pred, target_names=classes))
print("\n\n performance on test dataset : \n")
y_pred = logit.predict(X_test)
print(metrics.classification_report(y_test, y_pred, target_names=classes))
This linear SVC performs better than the logit, and is faster to compute.It still has a hard time to separate sitting and standing activities. In order to address this issue, we could try to :
In [69]:
plot_learning_curve(linSVC, "Learning Curve (SVC)", X_train, np.ravel(y_train), cv=5, n_jobs=-1, train_sizes=np.linspace(.1, 1.0, 10))
As seen in the learning curves, the performance in terms of training example starts to level off at 5000+ training examples. Obtaining more data would help to reduce variance of our score, but may not be sufficient to improve the model performance dramatically as we would still have the standing/sitting representation overlap in our data.
Trying a nonlinear SVM (e.g. using a RBF kernel) would be very costly in terms of time-complexity as this model would be $O(n^3p)$ and would imply at least one more tuning parameter, making the cross-validation step way harder.
Further work on signal representation seems to be a more efficient way to improve the model, as we saw that these two classes seems to present serious overlapping with our features (cf. 3D plots). If we manage to find a representation in which these two signals would be easier to separate, this would solve our problem. This could by done by using spherical wavelets rather than Fourier transform in a scattering network fashion to perform the classification (cf. Stephane Mallat & al. work http://www.di.ens.fr/data/software/scatnet/).
Our best model is a linear Support Vector Classifier. This model can be seen as a first step toward an health monitoring application for smartphone or wearable. There are still some issues to face in order to launch a proper product :
Adressing these issues would add complexity to our task. In this fashion, scattering networks with spherical wavelet discussed above would be of some help.
The model we developped is time stable, thus it would not be useful to refresh it regularly. However, we should try to obtain a broader dataset in term of individuals in order to take account of the variety of human bodies & movements. As we are using smartphones to collect the data, obtaining more data points would not be very costly in term of aquisition or storage. If some retraining would be necessary, it could be done automatically as the whole parameter selection steps are done by our code without human intervention. Some checks should be programmed in order to verify that our exploration of the parameter set is wide enough, and to send alerts if there is some drop in performance.
As we use a SVM, the model is not very scalable, but as seen previously, this is not an issue as the learning curves show that increase in performance from additional data would not be very significant.
Such application could be thought as a smartphone app financed by advertising revenues, or as a SAAS (software as a service). In the SAAS fashion, the app would be accessed by an API to allow developpers to integrate some health analytics in their application. Pricing would be set according to the transactions volume (e.g. number of API requests).